Goto

Collaborating Authors

 original data


DataSIR: ABenchmark Dataset for Sensitive Information Recognition

Neural Information Processing Systems

A.1 Comparison of Results for Gemini with Different Format Transformations Gemini attained optimal performance metrics for sensitive category and format transformation scenarios tasks, surpassing all comparator models in maximum achievable performance. The focus was then placed on Gemini's ability to recognize and restore both original and transformed data. The experimental results are shown in Table 1. In the main text section Experiments, due to space constraints, only four key observations were analyzed, as follows: i) The LRAcc and DRAcc of total format transformed data is less than original data, which indicates that it is more difficult to recognize and restore data after format transformed. These transformations only affect numbers, and only the IMEI and IMSI (purely numeric) sensitive categories support such transformations. Due to the lack of contextual information in the sample data, large language models may confuse these with personal identifiers, mobile numbers, and MEID.


Hyperbolic Dataset Distillation

Neural Information Processing Systems

To address the computational and storage challenges posed by large-scale datasets in deep learning, dataset distillation has been proposed to synthesize a compact dataset that replaces the original while maintaining comparable model performance. Unlike optimization-based approaches that require costly bi-level optimization, distribution matching (DM) methods improve efficiency by aligning the distributions of synthetic and original data, thereby eliminating nested optimization. DM achieves high computational efficiency and has emerged as a promising solution. However, existing DM methods, constrained to Euclidean space, treat data as independent and identically distributed points, overlooking complex geometric and hierarchical relationships. To overcome this limitation, we propose a novel hyperbolic dataset distillation method, termed HDD.


Discovering and Overcoming Limitations of Noise-engineered Data-free Knowledge Distillation

Neural Information Processing Systems

Distillation in neural networks using only the samples randomly drawn from a Gaussian distribution is possibly the most straightforward solution one can think of for the complex problem of knowledge transfer from one network (teacher) to the other (student). If successfully done, it can eliminate the requirement of teacher's training data for knowledge distillation and avoid often arising privacy concerns in sensitive applications such as healthcare. There have been some recent attempts at Gaussian noise-based data-free knowledge distillation, however, none of them offer a consistent or reliable solution. We identify the shift in the distribution of hidden layer activation as the key limiting factor, which occurs when Gaussian noise is fed to the teacher network instead of the accustomed training data. We propose a simple solution to mitigate this shift and show that for vision tasks, such as classification, it is possible to achieve a performance close to the teacher by just using the samples randomly drawn from a Gaussian distribution.




Appendix 1 Back imagination and Back speech

Neural Information Processing Systems

Figure 1: The illustrative examples for two proposed techniques: Back-imagination and Back-speech. Tiny ImageNet [Le and Y ang, 2015] serves as a compact version of the comprehensive ImageNet dataset. The Stanford Sentiment Treebank-2 (SST -2) [Socher et al., 2013] is a sentiment classification dataset Given the scarcity of datasets for understanding natural language in visual scenes, we introduce a novel textual entailment dataset, named Textual Natural Contextual Classification (TNCC). This dataset is formulated on the foundation of Crisscrossed Captions [Parekh et al., 2020], an image In this work, we employ a uniform experimental configuration for both textual entailment and sentiment classification tasks. For the image classification task, we employ the ResNet18 [He et al., 2015] model, which is considered more suitable for small datasets.


Supplementary Material A Data Modeling

Neural Information Processing Systems

In this section, we provide further details for our data modeling. We note the difficulties of appropriately modeling the terminal variable which is a binary variable compared to the rest of the dimensions which are continuous for the environments we investigate. This is particularly challenging for "expert" datasets where early termination is rare. An immediate advantage of sampling data from a generative model is compression. As we discuss in Appendix B.3, sampling is fast ER provides high levels of dataset compression without sacrificing downstream performance in offline reinforcement learning.